'data.frame': 928 obs. of 2 variables:
$ parent: num 70.5 68.5 65.5 64.5 64 67.5 67.5 67.5 66.5 66.5 ...
$ child : num 61.7 61.7 61.7 61.7 61.7 62.2 62.2 62.2 62.2 62.2 ...
ENVX2001 Applied Statistical Methods
The University of Sydney
Apr 2025
LO1. demonstrate proficiency in designing sample schemes and analysing data from them using using R
LO2. describe and identify the basic features of an experimental design; replicate, treatment structure and blocking structure
LO3. demonstrate proficiency in the use or the statistical programming language R to an ANOVA and fit regression models to experimental data
LO4. demonstrate proficiency in the use or the statistical programming language R to use multivariate methods to find patterns in data
LO5. interpret the output and understand conceptually how its derived of a regression, ANOVA and multivariate analysis that have been calculated by R
LO6. write statistical and modelling results as part of a scientific report
LO7. appraise the validity of statistical analyses used publications.
Adrien-Marie Legendre, Carl Friedrich Gauss, Francis Galton
Note
Many other people contributed to the development of regression analysis, but these three are the most well-known.
An example with Galton’s data: parent and child heights.
Galton, F. (1886). Regression Towards Mediocrity in Hereditary Stature Journal of the Anthropological Institute, 15, 246-263
'data.frame': 928 obs. of 2 variables:
$ parent: num 70.5 68.5 65.5 64.5 64 67.5 67.5 67.5 66.5 66.5 ...
$ child : num 61.7 61.7 61.7 61.7 61.7 62.2 62.2 62.2 62.2 62.2 ...
r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}}
All of these data have a correlation coefficient of about 0.8 – always visualise your data.
We want to predict a response Y based on a predictor x for i number of observations:
Y_i = \color{royalblue}{\beta_0 + \beta_1 x_i} +\color{red}{\epsilon_i}
where
\epsilon_i \sim N(0, \sigma^2)
Because \epsilon_i is the only part of the equation that is not fixed, we associate it with the residual error (observed-predicted). It would also cover other aspects of error (e.g. sampling error, parallax error) but these are hard to discern.
\color{royalblue}{\hat{y}_i} = \beta_0 + \beta_1 x_i
\hat\epsilon_i = y_i - \color{royalblue}{\hat{y}_i}
\hat\epsilon_i = y_i - \color{royalblue}{(\beta_0 + \beta_1 x_i)}
\sum_{i=1}^n \hat\epsilon_i^2 = \sum_{i=1}^n (y_i - \color{royalblue}{(\beta_0 + \beta_1 x_i)})^2
Finding the minimum SS requires solving the following problem:
\color{firebrick}{argmin_{\beta_0, \beta_1}} \sum_{i=1}^n (y_i - \color{royalblue}{(\beta_0 + \beta_1 x_i)})^2
We can find \beta_0 and \beta_1 analytically. We first find \beta_1:
\beta_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{Cov(x,y)}{Var(x)} = \frac{SS_{xy}}{SS_{xx}} And then substitute \beta_1 into the equation for \beta_0:
\beta_0 = \bar{y} - \beta_1 \bar{x}
Computers use “random guesses” to find set of parameters that minimises objective function (SS) – more computationally efficient and applies beyond linear regression.
lm()That’s it – the model has been fitted.
But there is a process similar to HATPC (hypothesis, assumptions, test, p-value, conclusions).
H_0: \beta_1=0
H_1: \beta_1 \neq 0
The null model is a line with no slope (i.e. flat or horizontal) at the mean of the child height (\bar{y} = 68 inches).
library(dplyr)
null_model <- Galton %>%
lm(child ~ 1, data = .) %>%
broom::augment(Galton)
lin_model <- Galton %>%
lm(child ~ parent, data = .) %>%
broom::augment(Galton)
models <- bind_rows(null_model, lin_model) %>%
mutate(model = rep(c("Null model", "SLR model"), each = nrow(Galton)))
ggplot(data = models, aes(x = parent, y = child)) +
geom_smooth(
data = filter(models, model == "Null model"),
method = "lm", se = FALSE, formula = y ~ 1, size = 1
) +
geom_smooth(
data = filter(models, model == "SLR model"),
method = "lm", se = FALSE, formula = y ~ x, size = 1
) +
geom_segment(
aes(xend = parent, yend = .fitted),
arrow = arrow(length = unit(0.1, "cm")),
size = 0.3, color = "darkgray"
) +
geom_point(alpha = .2) +
facet_wrap(~model) +
xlab("Parent height (in)") +
ylab("Child height (in)") +
theme_classic()The data must meet certain criteria, which we often call assumptions. They can be remembered using LINE:
Tip
All but the independence assumption can be assessed using diagnostic plots.
plot()ggfortify package and autoplot()performance(Also provides a guide on what to check for in the assumption plot)
Prior knowledge and visual inspection comes into play. Does the relationship look approximately linear?
This assumption is addressed during experimental design, but issues like correlation between errors and patterns occurring due to time are possible if:
For a given value of x, the residuals should be normally distributed. In a scatterplot of x and y, the points would appear evenly distributed (linear and no fanning).
Heavy-tailed, left-skewed.
Outliers are not a strict assumption, but they will affect the model fit.
Standardised\ residual = \frac{Residual}{Standard\ error\ of\ the\ residual}
How well does our fitted model represent the relationship between the variables?
ANOVA is a variation of linear regression – both partition variance into sum of squares for residuals (variance explained) and sum of squares for error (variance not explained) aka the components of the F-statistic.
parent Sum Sq: the variation that parent explains in the child variableResiduals Mean Sq: variation (per degree of freedom) that the model does not explainF-value is the ratio, i.e. does parent explain enough variation in child to be considered significant?\text{F-value} = \frac{\text{parent Sum Sq}}{\text{Residuals Mean Sq}} = \frac{1236.9}{5.01} = 246.84
ANOVA is a variation of linear regression – both partition variance into sum of squares for residuals (variance explained) and sum of squares for error (variance not explained) aka the components of the F-statistic.
The ANOVA suggests that the main effect of parent is statistically significant and large (F(1, 926) = 246.84, p < .001)
We fitted a linear model (estimated using OLS) to predict child with parent (formula: child ~ parent). The model explains a statistically significant and moderate proportion of variance (R2 = 0.21, F(1, 926) = 246.84, p < .001). Within this model, the effect of parent is statistically significant and positive (\beta_1 = 0.65, 95% CI [0.57, 0.73], t(926) = 15.71, p < .001).
Note
For simple linear regression, the significance of the predictor (i.e. child) is the same as the model significance.
Model fit and predictions
Call:
lm(formula = child ~ parent, data = Galton)
Residuals:
Min 1Q Median 3Q Max
-7.8050 -1.3661 0.0487 1.6339 5.9264
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.94153 2.81088 8.517 <2e-16 ***
parent 0.64629 0.04114 15.711 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.239 on 926 degrees of freedom
Multiple R-squared: 0.2105, Adjusted R-squared: 0.2096
F-statistic: 246.8 on 1 and 926 DF, p-value: < 2.2e-16
\widehat{child} = 23.9 + 0.65 \cdot parent
For every unit change in parent (i.e. 1 inch), we expect a 0.65 unit change in child.
How much variation is explained? R2 = 0.21 = 21%
What is the predicted child height for a parent height of 70 inches?
We use predict() to make predictions – it takes in the lm() model, recreates the equation and applies it to new data.
1
69.18187
Note
How good is our prediction actually? What if we had more parents and children, would the equation still hold up? We cover this in Week 9.
What if assumptions are not met, or we want to improve the model?
Note
We can also perform transformations to improve the model fit, but beware of overfitting – we want to make reasonable predictions, not fit the data!
Daily air quality measurements in New York, May to September 1973.
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
Is a simple linear model appropriate? Depends on your threshold for what is acceptable.
A log transformation (natural or a base) is relatively easy to back-transform.
\widehat{log(Ozone)}=\color{royalblue}{-1.8380 + 0.0675 \times Temp} \widehat{Ozone}=e^{-1.8380 + 0.0675 \times Temp}=e^{-1.8380} \times e^{0.0675 \times Temp} But given we are focused on a 1-unit change of Temp, \widehat{Ozone} changes by e^{0.0675} = 1.07 times.
If this had been a sqrt() transformation…
\widehat{\sqrt{Ozone}}=-1.8380 + 0.0675 \times Temp \widehat{Ozone}=(-1.8380 + 0.0675 \times Temp)^2 = 3.3782−(0.2481×Temp)+(0.0675×Temp)^2
Interpreting as a percent change can be more meaningful - it can be done with any log transformation (substitute e below for 10 or any other base), but the quick approximation only works with natural log transformations.
If y has been transformed with a natural log (log(y)), for a one-unit increase in x the percent change in y (not log(y)) is calculated with:
\Delta y \% = 100 \cdot (e^{\beta_1}-1)
If \beta_1 is small (i.e. -0.25 < \beta_1 < 0.25), then: e^{\beta_1} \approx 1 + \beta_1. So \Delta y \% \approx 100 \cdot \beta_1.
| β | Exact (e^{\beta} - 1)% | Approximate 100 \cdot \beta |
|---|---|---|
| -0.25 | -22.13 | -25 |
| -0.1 | -9.52 | -10 |
| 0.01 | 1.01 | 1 |
| 0.1 | 10.52 | 10 |
| 0.25 | 28.41 | 25 |
| 0.5 | 64.87 | 50 |
| 2 | 638.91 | 200 |
Let’s transform Ozone using the natural log (log()).
Call:
lm(formula = Ozone ~ Temp, data = airquality)
Residuals:
Min 1Q Median 3Q Max
-40.729 -17.409 -0.587 11.306 118.271
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -146.9955 18.2872 -8.038 9.37e-13 ***
Temp 2.4287 0.2331 10.418 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 23.71 on 114 degrees of freedom
(37 observations deleted due to missingness)
Multiple R-squared: 0.4877, Adjusted R-squared: 0.4832
F-statistic: 108.5 on 1 and 114 DF, p-value: < 2.2e-16
Call:
lm(formula = log(Ozone) ~ Temp, data = airquality)
Residuals:
Min 1Q Median 3Q Max
-2.14469 -0.33095 0.02961 0.36507 1.49421
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.83797 0.45100 -4.075 8.53e-05 ***
Temp 0.06750 0.00575 11.741 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5848 on 114 degrees of freedom
(37 observations deleted due to missingness)
Multiple R-squared: 0.5473, Adjusted R-squared: 0.5434
F-statistic: 137.8 on 1 and 114 DF, p-value: < 2.2e-16
The transformed model model equation is:
\widehat{log(Ozone)}=\color{royalblue}{-1.8380 + 0.0675 \times Temp} A 1 degree (°F) increase in temperature is associated with a:
log(Ozone) concentrationOzone concentrationOzone concentrationY = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_kx_k + \epsilon
where
Can we improve the current model by adding wind and solar radiation as additional predictors?
log(size)_i = \beta_0 + \beta_1Temp_i + \epsilon_i
log(size)_i = \beta_0 + \beta_1Temp_i + \color{royalblue}{\beta_2Solar.R_i + \beta_3Wind_i} + \epsilon_i
log(size)_i = \beta_0 + \beta_1Temp_i + \color{royalblue}{\beta_2Solar.R_i + \beta_3Wind_i} + \epsilon_i
There is one additional assumption for multiple linear regression. Collinearity is when two or more predictors are very highly correlated. If the predictors are basically identical, the model cannot distinguish how much variability each explains. (Correlations in previous slides look fine).
For multiple linear regression, there are two hypothesis tests:
H_0: \beta_k = 0 H_1: \beta_k \neq 0
H_0: \beta_1 = \beta_2 = ... = \beta_k = 0 H_1: \text{At least one } \beta_k \neq 0
Call:
lm(formula = log(Ozone) ~ Temp + Solar.R + Wind, data = airquality)
Residuals:
Min 1Q Median 3Q Max
-2.06193 -0.29970 -0.00231 0.30756 1.23578
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.2621323 0.5535669 -0.474 0.636798
Temp 0.0491711 0.0060875 8.077 1.07e-12 ***
Solar.R 0.0025152 0.0005567 4.518 1.62e-05 ***
Wind -0.0615625 0.0157130 -3.918 0.000158 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5086 on 107 degrees of freedom
(42 observations deleted due to missingness)
Multiple R-squared: 0.6644, Adjusted R-squared: 0.655
F-statistic: 70.62 on 3 and 107 DF, p-value: < 2.2e-16
Model equation:
\widehat{log(Ozone)}=-0.262 + 0.0492 \cdot Temp + 0.00252 \cdot Solar.R - 0.0616 \cdot Wind
\widehat{log(Ozone)}=-0.262 + 0.0492 \cdot Temp + 0.00252 \cdot Solar.R - 0.0616 \cdot Wind
Holding all other variables constant:
Temp is associated with a 4.9% increase in Ozone concentration.Solar.R is associated with a 0.25% increase in Ozone concentration.Wind is associated with a 6.2% decrease in Ozone concentration.Automating extracting the model equation into latex using extract_eq() from the package equatiomatic:
| log(Ozone) | log(Ozone) | |||
| Predictors | Estimates | p | Estimates | p |
| (Intercept) | -1.8380 | <0.001 | -0.2621 | 0.637 |
| Temp | 0.0675 | <0.001 | 0.0492 | <0.001 |
| Solar R | 0.0025 | <0.001 | ||
| Wind | -0.0616 | <0.001 | ||
| Observations | 116 | 111 | ||
| R2 / R2 adjusted | 0.547 / 0.543 | 0.664 / 0.655 | ||
We will discuss how to select the best subset of predictors for a model.
Questions? Comments?
Slides made with Quarto